What is the tidyverse?
The tidyverse consists of a few key packages for data import, manipulation, visualization and more.
library(tidyverse)
x = 1:3
y = 'a'
z = list(one = x, two = y)
x
y
z
str(z)
class(y)
A function to show how easy it is to create your own.
my_sum_times_two <- function(x, y) {
2 * sum(x, y)
}
my_sum_times_two(1, 2)
Vectors form the basis of R data structures. Two main types are atomic and lists.
my_vector <- c(1, 2, 3) # standard vector
my_list <- list(a = 1, b = 2) # a named list
my_list
Data frames are a special kind of list, and probably the most commonly used for data science purposes.
my_data = data.frame(
id = 1:3,
name = c('Vernon', 'Ace', 'Cora')
)
my_data
class(my_data)
Importing data is usually the first step.
demographics = read.csv('data/demos_anonymized.csv')
ids = read.csv('data/ids_anonymized.csv')
Databases must be connected to, but otherwise are used just like data frames.
# requires DBI and RSQLite packages; just for demo
library(DBI)
con <- dbConnect(RSQLite::SQLite(), ":memory:")
# con
copy_to(con, demographics, 'demos')
A common step is to subset the data by column.
demographics %>%
select(gender, age, libuser)
demographics %>%
select(-libuser)
demographics %>%
select(starts_with('award'))
To filtering data, think of a logical statement, something that can be TRUE or FALSE.
my_filtered_data = demographics %>%
filter(age < 40)
my_filtered_data = demographics %>%
filter(libuser == 1)
Another very common data processing task is to generate new variables.
demographics = demographics %>%
mutate(new_age = (age - mean(age, na.rm = T))/sd(age, na.rm = T))
demographics = demographics %>%
rename(age_std = new_age)
demographics %>%
rename_all(toupper) %>%
colnames()
Merging data can take on a variety of forms, and depending on the data, can be be quite complicated.
# same N rows as demos
left_join(demographics, ids)
# only ~ 50k rows
inner_join(demographics, ids)
Use the : operator to select successive columns.
colnames(demographics)
demographics %>%
select(?)
Filter the data to award amounts less than 500000.
demographics %>%
filter(award_total_amount ?)
Generate a new award amount variable that is the log of the original. Give the new variable a useful name.
demographics %>%
mutate(? = log(?))
Using Python for data science is not far removed from R. Python’s main data processing module is pandas, which serves as a means to provide R-like data frames to the world of Python.
# note how when using something other than R, you have to specify the engine path
import pandas as pd
import numpy as np
demographics = pd.read_csv('data/demos_anonymized.csv')
ids = pd.read_csv('data/ids_anonymized.csv')
demographics.head() # show a few lines
# select by name
demographics[['age', 'award_total_amount']]
# select successive columns
demographics.loc[:,'libuser':'age']
# select by pattern
demographics.filter(regex='^award')
my_filtered_data = demographics[demographics.libuser == 1]
my_filtered_data.libuser.nunique()
demographics[['new_age']] = (demographics[['age']] - np.mean(demographics[['age']])) / np.std(demographics[['age']])
demographics.new_age.describe() # mean = 0 sd = 1
demographics = demographics %>%
rename(age_std = new_age)
demos_joined = pd.merge(demographics, ids, how='left', on='EMPLID')
demos_joined
demos_joined = demographics.join(ids, how='left', lsuffix='EMPLID')
demos_joined.shape
demos_joined.columns
demos_joined = demographics.join(ids, how='inner', lsuffix='EMPLID')
demos_joined.columns